Objective
Cluster high-flow event characteristics and antecedent watershed conditions to evaluate how these factors converge as flux regimes (clusters) to produce variability in:
1. Event NO3 yields
2. Event SRP yields
3. Event turbidity yields
4. Event NO3:SRP yield ratios
Select variables to keep
Per K Underwood: if two variables are strongly correlated (negatively or positively) they can effectively “double-weight” a particular factor important in driving clustering; thus, keep just one of the variables to serve as a proxy for that factor

Decisions for eliminating variables w/ correlations >70%
These were used for the 2020-12-10 run:
- These decisions were tough to make and need review
- rain_event_total_mm and API_4d are highly correlated (88%), but represent different things; I’m going to leave both for now, but could try versions w/ one or the other to see if it matters
- Tough decisions that need review (continued)
- q_event_max and q_mm are highly correlated (83.8%); let’s drop q_event_max for now, because?
- SoilTemp_pre_wet_15cm and VWC_pre_wet_15cm are correlated (72.3%), but they’re pretty different and close enough to 70% correlation cutoff; will leave both for now and could try versions w/ one or the other to see if it matters
- I feel OK about these decisions, but they should be reviewed as well
- If the 1-d and 4-d values for a variable are highly correlated, use the 4-d value
- gw_1d_allWells and gw_4d_allWells are highly correlated (99.2%); remove gw_1d_allWells
- Trying to find a VWC variable that correlates well with GW level so that I can remove GW level vars (no GW data in 2017)
- when dropping all GW variables; n obs increases from 45 to 51
- gw_4d_allWells and VWC_pre_wet_30cm most highly correlated (94.5%); same with VWC_pre_wet_15cm (94.2%), except slightly less linear at higher values
- drop gw_4d_allWells and use VWC as a proxy for GW level
- VWC_pre_wet_15cm and VWC_pre_wet_30cm highly correlated (94.7%)
- I want to keep all 15cm values for soil vars including VWC, so for now we’ll drop VWC_pre_wet_30cm to avoid double-weighting VWC
- MET variables
- airT_1d and airT_4d are highly correlated (90.1%); removing airT_1d
- airT_4d and dewPoint_4d are highly correlated (96.6%), as is dewPoint_1d; removing both dewPoints
- airT_4d and SoilTemp_pre_wet_15cm are highly correlated (92.9%); drop airT_4d b/c we still have diff_airT_soilT
- solarRad_1d and solarRad_4d are highly correlated (74.7%); drop solarRad_1d per rule above
- Q
- q_event_delta & q_event_max are highly correlated (96.3%); q_event_max is more normally distributed, so let’s keep it vs delta
- q_1d and q_4d are highly correlated (86.4%), so sticking with rule above will keep the q_4d
- Drop q_event_dQRate_cmsPerHr b/c it’s confusing and hopefully q_event_delta or rain intensity will capture this
- Redox
- If redox variables prove not to be important or they are highly correlated with another variable, we can remove them and increase n obs by at least 7
- Rain
- Drop all the rain_Xd vars, b/c API_4d should cover this, though would be interesting to test how many days pre-event (e.g., 4 days for API) matters
- rain_int_mmPERmin_mean and rain_int_mmPERmin_max are correlated (74.4%); drop _max and so we can keep the mean intensity of the rain event
- Stream
- turb_1d proved to be unuseful in driving clusters in SOM, so I removed it
Look at correlations again after dropping variables

Self-organizing map (SOM)
Prepare data & set up grid/lattice dimensions
We’re only using complete observations/rows (no NAs in any columns)
According to the heuristic rule from Vesanto 2000, number of grid elements/grid size/nodes = 5 * sqrt(n)
To determine the the shape of the grid (ratio of columns to rows), we use the ratio of the first two eigen values of the input data set as recommended by Park et al. 2006
## [1] No. of complete observations: 49 out of 76 observations
## [1] No. of Vesanto nodes: 35
## [1] Ratio of columns to rows: 1.8
Run SOM for a suite of grid/lattice configurations, # of nodes, and # of clusters
Code courtesy of Kristen Underwood (hidden)
## [1] Topology: hexagonal
## [1] Data normalization method used: L2norm
## [1] Weighting method used: noPCA
## [1] No. of iterations: 1500
## [1] alphaCrs used: 0.05
## [1] alphaFin used: 1500
Choose the best SOM run based on non-parametric F-stat and quantization error
We want to maximize npF (ratio of b/w cluster variance) and minimize QE (mean distance b/w each data vector & best-matching unit)
Here are the top 33% of runs based on npF
Examine boxplots of independent variables by cluster

Examine how antecedent and event conditions converge to influence N & P flux regimes:
How do our results differ if we choose the 2nd best SOM run?
## [1] The 2nd best run was:
| Run |
rows |
cols |
Nodes |
Clusters |
npF |
QE |
| 39 |
4 |
10 |
40 |
5 |
23.62 |
0.083 |
To examine this run in greater detail (e.g., component planes), see the ‘X_SOMplots_site_ … .pdf’

